Required libraries¶

In [1]:
import pandas as pd
import numpy as np
import dalex as dx
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold, train_test_split, GridSearchCV, cross_val_score
from sklearn_extra.cluster import KMedoids

import warnings
warnings.filterwarnings('ignore')

Data load¶

In [2]:
df = pd.read_csv('DATA.csv')

Preliminary Analysis¶

In [3]:
df.head()
Out[3]:
Patient_ID Systemic Illness Rectal Pain Sore Throat Penile Oedema Oral Lesions Solitary Lesion Swollen Tonsils HIV Infection Sexually Transmitted Infection MonkeyPox
0 P0 None False True True True False True False False Negative
1 P1 Fever True False True True False False True False Positive
2 P2 Fever False True True False False False True False Positive
3 P3 None True False False False True True True False Positive
4 P4 Swollen Lymph Nodes True True True False False True True False Positive
In [4]:
print(f'There are {df.shape[0]} patients in dataset')
There are 25000 patients in dataset
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 11 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Patient_ID                      25000 non-null  object
 1   Systemic Illness                25000 non-null  object
 2   Rectal Pain                     25000 non-null  bool  
 3   Sore Throat                     25000 non-null  bool  
 4   Penile Oedema                   25000 non-null  bool  
 5   Oral Lesions                    25000 non-null  bool  
 6   Solitary Lesion                 25000 non-null  bool  
 7   Swollen Tonsils                 25000 non-null  bool  
 8   HIV Infection                   25000 non-null  bool  
 9   Sexually Transmitted Infection  25000 non-null  bool  
 10  MonkeyPox                       25000 non-null  object
dtypes: bool(8), object(3)
memory usage: 781.4+ KB
In [6]:
df.describe().T
Out[6]:
count unique top freq
Patient_ID 25000 25000 P0 1
Systemic Illness 25000 4 Fever 6382
Rectal Pain 25000 2 False 12655
Sore Throat 25000 2 True 12554
Penile Oedema 25000 2 True 12612
Oral Lesions 25000 2 False 12514
Solitary Lesion 25000 2 True 12527
Swollen Tonsils 25000 2 True 12533
HIV Infection 25000 2 True 12584
Sexually Transmitted Infection 25000 2 False 12554
MonkeyPox 25000 2 Positive 15909

Missing values check¶

As you can see we don't have missing values. Also, we cannot check if there are any duplicates, because almost all features are binary and we don't have numerical features which can help to distinguish duplicates and another person with the same parameters¶
In [7]:
print(f'Missing values: {sum(df.isna().sum()) + sum(df.isnull().sum())} ')
Missing values: 0 

Let's drop useless feature Patient_ID¶

In [8]:
df.drop('Patient_ID', axis=1, inplace=True)

Eploratory Data Analysis¶

Distribution of our target feature(MonkeyPox) in percents¶

In [9]:
df['MonkeyPox'].value_counts(normalize=True)
Out[9]:
Positive    0.63636
Negative    0.36364
Name: MonkeyPox, dtype: float64

And let's vizualize it¶

We can see that data is unbalanced. There are twice more people who have Monkey Pox, then who don't have¶
In [10]:
palette = sns.color_palette('pastel')
fig, ax = plt.subplots(figsize=(8, 4))
sns.countplot(y='MonkeyPox', data=df, palette=palette, ax=ax)
ax.set_title('Distribution of Monkey Pox', fontsize=15);

Charts showing differences in Systemic Illness between people having Monkey Pox and not¶

We already can see that among people with Monkey Pox Fever and Swollen Lymph Nodes dominate. While among not infected people these illnesses are significantly rare than others¶
In [11]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

illness_dist_pos = df[df['MonkeyPox'] == 'Positive']['Systemic Illness'].value_counts().sort_index()
ax[0].pie(x=illness_dist_pos.values, 
          labels=illness_dist_pos.index, 
          colors=palette, 
          autopct='%.0f%%')
ax[0].set_title('Distribution of systemic illness between infected', fontsize=15);

illness_dist_neg = df[df['MonkeyPox'] == 'Negative']['Systemic Illness'].value_counts().sort_index()
ax[1].pie(x=illness_dist_neg.values, 
          labels=illness_dist_neg.index, 
          colors=palette, 
          autopct='%.0f%%')
ax[1].set_title('Distribution of systemic illness between not infected', fontsize=15);

Distribution of all other binary features with respect to target feature¶

In [12]:
fig, ax = plt.subplots(2, 4, figsize=(20, 8))
ax = ax.flatten()
for idx, feature in enumerate(df.columns.drop(['MonkeyPox', 'Systemic Illness'])):
    sns.countplot(x=feature, hue='MonkeyPox', data=df, palette=palette, ax=ax[idx])
    ax[idx].set_title(feature, fontsize=20)
    ax[idx].set(ylabel=None, xlabel=None)
    ax[idx].tick_params(axis='both', labelsize=12)

Feature importance¶

Firstly I use OrdinalEncoder to make data numerical¶

In [13]:
df_enc = pd.DataFrame(OrdinalEncoder().fit_transform(df), columns=df.columns)
df_enc.head()
Out[13]:
Systemic Illness Rectal Pain Sore Throat Penile Oedema Oral Lesions Solitary Lesion Swollen Tonsils HIV Infection Sexually Transmitted Infection MonkeyPox
0 2.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0
2 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0
3 2.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 1.0
4 3.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 1.0

Divide all features and target one, create Random Forest and fit it¶

In [14]:
x, y = df_enc.drop('MonkeyPox', axis=1), df_enc['MonkeyPox']
In [15]:
rfc = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=17)
rfc.fit(x, y)
Out[15]:
RandomForestClassifier(n_jobs=-1, random_state=17)

Create Dalex Explainer¶

In [16]:
exp = dx.Explainer(rfc, x,  y)
Preparation of a new explainer is initiated

  -> data              : 25000 rows 9 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 25000 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_proba_default at 0x7f4ca78e9040> will be used (default)
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0, mean = 0.636, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.971, mean = 5.13e-05, max = 0.963
  -> model_info        : package sklearn

A new explainer has been created!

Plot showing the global impact of each feature. It is based on the method of Permutation importance, where the average reduction in accuracy caused by a variable removing is its importance score. The greater the reduction in accuracy the greater the score. But as far it is not computationally easy to delete particular feature, we simply shuffle it thus simulating the removal¶

It can be noticed, that Systemic Illness has very big impact. Then, Rectal Pain, HIV Infection and Sexually Transmitted Infection, etc.¶
In [17]:
exp.model_parts().plot()

Partial dependency plots show the average prediction result for values of each feature.¶

For example Systemic Illness has 4 unique values (in our encoded dataset 0, 1, 2, 3). And for values 0 and 3 average output is about 0.75, which is closer to 1(MonkeyPox: Positive), while 1 and 2 are closer to 0¶

It can be seen, that for the most important features from the previous plot, lines are jumping more (especially for Systemic Illness)¶
Therefore, it can be concluded, that the difference between values at point 0 (False) and point 1 (True) is insufficient in cases of, e.g.: Swollen Tonsils, Solidarity Lesions¶
In [18]:
exp.model_profile().plot()
Calculating ceteris paribus: 100%|██████████████████████████████████████████████| 9/9 [00:00<00:00, 26.17it/s]

For the next plots, I want to find the most representative instances of people who have Monkey Pox and do not.¶

For this I use k-means clustering and take the closest sample to the cluster center (medoid).¶

If just k-means is applied to the dataset, the results are not the same as the target feature. So we firstly split the dataset by target and then look for medoid in each subset¶

In [19]:
df_list = [d for _, d in df_enc.groupby(['MonkeyPox'])]   
medoids = {}
for d in df_list:
    kmedoids = KMedoids(n_clusters=1, random_state=17)
    kmedoids.fit(d.drop('MonkeyPox', axis=1))
    medoids[d['MonkeyPox'].iloc[0]] = kmedoids.cluster_centers_[0]
In [20]:
medoids
Out[20]:
{0.0: array([2., 0., 0., 0., 0., 0., 0., 0., 0.]),
 1.0: array([1., 1., 1., 1., 1., 1., 1., 1., 1.])}

From this plots we can see which attributes and how contribute to prediction of "the healthiest" and "the most infected" people¶

In [21]:
for key in medoids:
    exp.predict_parts(medoids[key], type='break_down_interactions').plot()

And similar plots which you can find usefull¶

In [22]:
for key in medoids:
    print(f'Breakdown for the most representative sample of target {key}')
    exp.predict_parts(medoids[key], type='shap').plot()
Breakdown for the most representative sample of target 0.0
Breakdown for the most representative sample of target 1.0

Scoring / Tuning¶

Finally let's test Random Forest on this dataset¶

I don't use accuracy here, because the data is unbalanced. Roc-auc is better here. And 5-fold cross-validation for better estimation¶

In [23]:
score = cross_val_score(RandomForestClassifier(n_jobs=-1, random_state=17), 
                        x, y, scoring='roc_auc',
                        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=17))
In [24]:
print(f'Roc-auc of the model: {np.mean(score)}')
Roc-auc of the model: 0.6682451345618627

I think it is a good idea to test recall here, because in our case it's less harmful to predict MonkeyPox wrongly to someone not infected, having all the infected ones correctly classified.¶

For example when we have 8 healthy people and 2 infected, if we predict all as infected we would have 20% accuracy, but then diagnostic would discover that 8 people are healthy while the other 2 would get treatment. However if we predict all as healthy we will have 80% accuracy, but then these 2 people wouldn't have the treatment and would infect other people and have disease complications later¶

In [25]:
score = cross_val_score(RandomForestClassifier(n_jobs=-1, random_state=17), 
                        x, y, scoring='recall',
                        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=17))
In [26]:
print(f'Recall of the model: {np.mean(score)}')
Recall of the model: 0.8597030095608135

I get best parameters for the dataset by tuning them with GridSearchCV¶

In [27]:
params = {'n_estimators': [10, 50, 100, 150],
          'max_features': [3, 6, 9],
          'min_samples_leaf': [1, 3, 5, 7],
          'max_depth': [3, 5, 10, 15, 20]}

gr = GridSearchCV(estimator=RandomForestClassifier(n_jobs=-1, random_state=17),
                  param_grid=params,
                  cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=17),
                  scoring='recall',
                  n_jobs=-1)

gr.fit(x, y)
Out[27]:
GridSearchCV(cv=StratifiedKFold(n_splits=3, random_state=17, shuffle=True),
             estimator=RandomForestClassifier(n_jobs=-1, random_state=17),
             n_jobs=-1,
             param_grid={'max_depth': [3, 5, 10, 15, 20],
                         'max_features': [3, 6, 9],
                         'min_samples_leaf': [1, 3, 5, 7],
                         'n_estimators': [10, 50, 100, 150]},
             scoring='recall')
In [28]:
print(f'Locally best RandomForest: {gr.best_score_}')
print(gr.best_params_)
Locally best RandomForest: 0.9792570243258534
{'max_depth': 3, 'max_features': 3, 'min_samples_leaf': 1, 'n_estimators': 100}